setwd("~/workshops/lessons/intro-to-r/ggplot/")

If you don’t have the data from yesterday’s workshop, load it.

metadata <- read.csv("http://www.datacarpentry.org/R-genomics/data/Ecoli_metadata.csv")
head(metadata)
> #     sample generation   clade strain     cit       run genome_size
> # 1   REL606          0    <NA> REL606 unknown                  4.62
> # 2 REL1166A       2000 unknown REL606 unknown SRR098028        4.63
> # 3   ZDB409       5000 unknown REL606 unknown SRR098281        4.60
> # 4   ZDB429      10000      UC REL606 unknown SRR098282        4.59
> # 5   ZDB446      15000      UC REL606 unknown SRR098283        4.66
> # 6   ZDB458      20000 (C1,C2) REL606 unknown SRR098284        4.63

We are studying a population of Escherichia coli (designated Ara-3), which were propagated for more than 40,000 generations in a glucose-limited minimal medium. This medium was supplemented with citrate which E. coli cannot metabolize in the aerobic conditions of the experiment. Sequencing of the populations at regular time points reveals that spontaneous citrate-using mutants (Cit+) appeared at around 31,000 generations. This metadata describes information on the Ara-3 clones and the columns represent:

Column Description
sample clone name
generation generation when sample frozen
clade based on parsimony-based tree
strain ancestral strain
cit citrate-using mutant status
run Sequence read archive sample ID
genome_size size in Mbp (made up data for this lesson)
genome_size <- metadata$genome_size

Scatterplot

plot(genome_size)

plot(genome_size, pch = 8)

plot(genome_size, pch = 8, main = "Scatter plot of genome sizes")

Histogram

hist(genome_size)

Boxplot

boxplot(genome_size ~ cit, metadata)

boxplot(genome_size ~ cit, metadata, col = c("pink", "purple", "darkgrey"), 
    main = "average expression difference between celltypes", ylab = "Expression")

More advanced figures (ggplot2)

# install.packages('ggplot2')
library(ggplot2)
ggplot(metadata)

ggplot(metadata) + geom_point()
ggplot(metadata) + geom_point(aes(x = sample, y = genome_size))

ggplot(metadata) + geom_point(aes(x = sample, y = genome_size, color = generation, 
    shape = cit), size = rel(3)) + theme(axis.text.x = element_text(angle = 45, 
    hjust = 1))

Histogram

  • To plot a histogram, we need geom_bar which requires a statistical transformation.
  • Some plot types (such as scatterplots) do not require a transformation - each point is plotted by x and y
  • Plots such as boxplot, histograms, prediction lines, etc. need to transformed
ggplot(metadata) + geom_bar(aes(x = genome_size))

Let’s specify a binwidth.

ggplot(metadata) + geom_bar(aes(x = genome_size), stat = "bin", binwidth = 0.05)

Exercise

Adjust the binwidth for this graph and see how it changes the plot.

Boxplot

  • Let’s try plotting a boxplot similar to what we had using base R.
  • We can add some additional layers: plot title, axis labels
ggplot(metadata) + geom_boxplot(aes(x = cit, y = genome_size, fill = cit)) + 
    ggtitle("Boxplot of genome size by citrate mutant type") + xlab("citrate mutant") + 
    ylab("genome size") + theme(panel.grid.major = element_line(size = 0.5, 
    color = "grey"), axis.text.x = element_text(angle = 45, hjust = 1), axis.title = element_text(size = rel(1.5)), 
    axis.text = element_text(size = rel(1.25)))

Writing figures to files

pdf("figure/boxplot.pdf")

ggplot(metadata) + geom_boxplot(aes(x = cit, y = genome_size, fill = cit)) + 
    ggtitle("Boxplot of genome size by citrate mutant type") + xlab("citrate mutant") + 
    ylab("genome size") + theme(panel.grid.major = element_line(size = 0.5, 
    color = "grey"), axis.text.x = element_text(angle = 45, hjust = 1), axis.title = element_text(size = rel(1.5)), 
    axis.text = element_text(size = rel(1.25)))

dev.off()

More with GGPLOP2

Let’s start by loading our recombination on human genetic diversity data Dataset_S1.txt. If we downloaded to our machine we can load from file:

d <- read.csv("data/Dataset_S1.txt")

We can also read from the web.

d <- read.csv("https://raw.githubusercontent.com/vsbuffalo/bds-files/master/chapter-08-r/Dataset_S1.txt")
# fixing the colnames to 'percent.GC' from X.GC
colnames(d)[12] <- "percent.GC"
# diversity estimate scaled
d$diversity <- d$Pi/(10 * 1000)  # rescale, removing 10x and making per bp
summary(d$diversity)
> #      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
> # 0.0000000 0.0005577 0.0010420 0.0012390 0.0016880 0.0265300
# position -midpoint b/t start and end
d$position <- (d$end + d$start)/2
# add logical for whether window is in a centromeric region
d$cent <- d$start >= 25800000 & d$end <= 29700000

We are ready to plot:

ggplot(d) + geom_point(aes(x = position, y = diversity))

Challenge

We use + xlab('label') and + ylab('label') for adding lables to the axes. Add a x label of “Chromosome position (basepairs)” to the x axes and “Nucleotide diversity” to the y.

Answer:

ggplot(d) + geom_point(aes(x = position, y = diversity)) + xlab("chromosome position (basepairs)") + 
    ylab("nucleotide diversity")

  • You can also set the limits for continuous axes using the function scale_x_continuous(limits=c(start, end)) where start and end are the start and end of the axes (and scale_y_continuous() for the y axis).
  • You can change an axis to a log10-scale using the functions scale_x_log10() and scale_y_log10(). ggplot2 has numerous other scale options for discrete scales, other axes transforms, and color scales; see http://docs.ggplot2.org for more detail.

  • Aes mappings can be added to the ggplot() function
  • This creates the same mapping as above:

ggplot(d, aes(x = position, y = diversity)) + geom_point()

  • Now about the plot, notice the diversity estimates in the middle of graph
  • what is going on?
  • ggplot2 makes it easy to find out what’s going on with EDA
  • Let’s try to map a confounder or explanitory variable to another aesthetic to see if it shows any pattern.
  • Map color aes of our point geometric object to the column cent - this indicates whether the window falls in the centromeric region of this chromosome
ggplot(d) + geom_point(aes(x = position, y = diversity, color = cent))

  • The region with the missing diversity is around the centromeric
  • ggplot2 well chosen defaults that allows you to quickly create/adjust plots without fussing & then go back to change if needed
    • for data like logial or factors types discrete color palettes are used
    • numeric data gets continuous color palettes
  • Exploratory data analysis is iterative and interactive process
  • one problem with plot above is how much overplotting is occurring - data oversaturation below 0.5
  • To alleviate overplotting let’s try to make the points transparent (alpha)
  • regions with multiple overlapping points appear dark
ggplot(d) + geom_point(aes(x = position, y = diversity), alpha = 0.01)

* Notice we set the alpha outside the aes() function * This is b/c we aren’t mapping to a variable in the data frame, but giving it a fixed value for all data points * Beyond showing lack of diversity est. in the centromeric windows, nothing really shows * Part of problem still overplotting * Bigger issue is that windows span 63 megabases and difficult to detect regional patterns with data at this scale

  • Let’s look at diff geom - density of diversity across all positions
  • geom_density() calculates density
ggplot(d) + geom_density(aes(x = diversity), fill = "black")

  • Default behavior ggplot2 uses a line to draw densities, fill=black fill in the lines so clearer
  • We can map a color to a discrete-valued column in our dataframe
  • geom_density will create separate density plots, group data by column mapped to color
ggplot(d) + geom_density(aes(x = diversity, fill = cent), alpha = 0.4)

  • now we can see a pattern
  • diversity is skewed to more extreme values in centromeric regions
  • try mapping same figure without mapping color
  • mappling columns to additional aesthetic attributes can reveal patterns and information in the data might not be apparent in simple plots

Smoothing

ggplot(d, aes(x = depth, y = total.SNPs)) + geom_point() + geom_smooth()

ggplot(d, aes(x = percent.GC, y = depth)) + geom_point() + geom_smooth()

Faceting

Let’s use the motifs data and do a bit of cleaning on it first. Both of these datasets were created using the GenomicRanges tools we will learn about in Chapter 9, from tracks downloaded directly from the UCSC Genome Browser.

mtfs <- read.delim("https://raw.githubusercontent.com/vsbuffalo/bds-files/master/chapter-08-r/motif_recombrates.txt", 
    header = TRUE)
rpts <- read.delim("https://raw.githubusercontent.com/vsbuffalo/bds-files/master/chapter-08-r/motif_repeats.txt", 
    header = TRUE)

Our goal is to merge the column name in the rpts dataframe into the mtfs column, so we know which repeat each motif is contained in (if any). The link between these two datasets are the positions of each motif, identified by the chromosome and motif start position columns chr and motif_start.

# concatenating columns chr+motif_start into a single key string column can
# simplify merging
mtfs$pos <- paste(mtfs$chr, mtfs$motif_start, sep = "-")
rpts$pos <- paste(rpts$chr, rpts$motif_start, sep = "-")
# use match() to find where each of the mtfs$pos keys occur in the rpts$pos.
# We’ll create this indexing vector first before doing the merge
i <- match(mtfs$pos, rpts$pos)
# using this indexing vector we select out elements of rpts$name and merge
# these into mtfs:
mtfs$repeat_name <- rpts$name[i]
# same effect as above using match()’s results to i and use this directly
mtfs$repeat_name <- rpts$name[match(mtfs$pos, rpts$pos)]

Now let’s plot the relationship b/t recombination rate and distance to a motif using the mtfs from above.

p <- ggplot(mtfs, aes(x = dist, y = recom)) + geom_point(size = 1)
p <- p + geom_smooth(method = "loess", se = FALSE, span = 1/10)
print(p)

Challenge

Use help in R (or Rstudio) ?geom_smooth and try other parameters for span & method.

  • Explore mtfs mtfs$motif column contains two variations of the sequence motif.
unique(mtfs$motif)
> # [1] CCTCCCTGACCAC CCTCCCTAGCCAC
> # Levels: CCTCCCTAGCCAC CCTCCCTGACCAC
  • do these motfis have any effects on local recombination?
  • let’s try grouping and coloring the loess curves by motif sequence
ggplot(mtfs, aes(x = dist, y = recom)) + geom_point(size = 1) + geom_smooth(aes(color = motif), 
    method = "loess", se = FALSE, span = 1/10)

  • notice we put the aes(color=motif) in the geom_smooth
  • another way to group in ggplot2 is via faceting or coditioning plotting (subplots)
p <- ggplot(mtfs, aes(x = dist, y = recom)) + geom_point(size = 1, color = "grey")
p <- p + geom_smooth(method = "loess", se = FALSE, span = 1/10)
p <- p + facet_wrap(~motif)
print(p)

  • ggplot2 has two facet methods: facet_wrap() and facet_grid()
  • facet_wrap() takes a factor column and creates a panel for each level and wraps horizontally
  • facet_grid() allows finder control of facets by allowing you to specify the columns to use for vertical and horz facets:
p <- ggplot(mtfs, aes(x = dist, y = recom)) + geom_point(size = 1, color = "grey")
p <- p + geom_smooth(method = "loess", se = FALSE, span = 1/16)
p <- p + facet_grid(repeat_name ~ motif)
print(p)

* shows some of the same patterns. * Motif CCTCCCTAGCCACon the THE1B repeat background has strong effect as does CCTCCCTGACCAC on the L2 repeat background. * you can get a sense of the data tha goes in this plot with:

table(mtfs$repeat_name, mtfs$motif, useNA = "ifany")
> #        
> #         CCTCCCTAGCCAC CCTCCCTGACCAC
> #   L2              457          4110
> #   THE1B          4651             0
> #   <NA>           2908          7924
  • ~ used with facet_wrap() and facet_grid is how we specify model formulas in R
  • help(formula) for more
  • x and y scales are the same across all panels
  • this might obscure patterns
  • use scales = "free_x" or scales = "free_y" or scales = "free" for both
p <- ggplot(mtfs, aes(x = dist, y = recom)) + geom_point(size = 1, color = "grey")
p <- p + geom_smooth(method = "loess", se = FALSE, span = 1/10)
p <- p + facet_wrap(~motif, scales = "free_y")
print(p)

Challenge

Use facets to look at the data but group by chromosome

facet_wrap(~chr)